{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "111dcc20",
   "metadata": {},
   "source": [
    "### 2021.05.08 数据预处理学习\n",
    "* os文件/目录方法模块学习"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c739a53f",
   "metadata": {},
   "source": [
    "举一个例子，我们首先创建一个人工数据集，并存储在csv（逗号分隔值）文件 ../data/house_tiny.csv 中。以其他格式存储的数据也可以通过类似的方式进行处理。下面的mkdir_if_not_exist 函数可确保目录 ../data 存在。注意，注释 #@save是一个特殊的标记，该标记下方的函数、类或语句将保存在 d2l 软件包中，以便以后可以直接调用它们（例如 d2l.mkdir_if_not_exist(path)）而无需重新定义。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "854ff181",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.makedirs(os.path.join('..', 'data'), exist_ok=True)# os.makedirs() 方法用于递归创建目录\n",
    "data_file = os.path.join('..', 'data', 'house_tiny.csv')\n",
    "with open(data_file, 'w') as f:\n",
    "    f.write('NumRooms,Alley,Price\\n')  # 列名\n",
    "    f.write('NA,Pave,127500\\n')  # 每行表示一个数据样本\n",
    "    f.write('2,NA,106000\\n')\n",
    "    f.write('4,NA,178100\\n')\n",
    "    f.write('NA,NA,140000\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c84df9",
   "metadata": {},
   "source": [
    "## Note：\n",
    "#### os.makedirs() 方法用于递归创建目录\n",
    "* makedirs()方法语法格式如下：\n",
    " * os.makedirs(path, mode=0o777)\n",
    "* 参数\n",
    " * path -- 需要递归创建的目录，可以是相对或者绝对路径。\n",
    " * mode -- 权限模式。\n",
    " \n",
    "#### os.path.join()函数用于路径拼接文件路径\n",
    "* 路径表示\n",
    " * . 表示当前目录\n",
    " * .. 表示当前目录的上一级目录。\n",
    " * ./表示当前目录下的某个文件或文件夹，视后面跟着的名字而定\n",
    " * ../表示当前目录上一级目录的文件或文件夹，视后面跟着的名字而定。\n",
    "# \n",
    "* os.path.join('..', 'data') 表示路径..data，实际是在当前目录创建data文件夹\n",
    " * 会从第一个以”/”开头的参数开始拼接，之前的参数全部丢弃。\n",
    " * 以上一种情况为先。在上一种情况确保情况下，若出现”./”开头的参数，会从”./”开头的参数的上一个参数开始拼接。\n",
    " * 有多个以”/”开头的参数，从最后”/”开头的的开始往后拼接，之前的参数全部丢弃\n",
    " * !!!注意：Linux下和Windows下有所区别，这是基于Windows下的结论，见评论：python路径拼接os.path.join()函数完全教程https://blog.csdn.net/weixin_37895339/article/details/79185119\n",
    " * os.path.join('..', 'data', 'house_tiny.csv') 表示当前目录的data文件夹下的house_tiny.csv文件的目录\n",
    " \n",
    "#### with open(data_file, 'w') as f: f.write()\n",
    " * 文件的写操作；\n",
    " * 相关用法见：\n",
    "  * python 使用 with open（） as 读写文件：https://blog.csdn.net/xrinosvip/article/details/82019844 ；\n",
    "  * with open() as f 用法：https://blog.csdn.net/wzhrsh/article/details/101629075 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "05b4a122",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1: /bbbb\\ccccc.txt\n",
      "2: /ccccc.txt\n",
      "3: aaaa\\./bbb\\ccccc.txt\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "# windows环境下，结果如上所述，结论正确\n",
    "print(\"1:\",os.path.join('aaaa','/bbbb','ccccc.txt'))\n",
    "print(\"2:\",os.path.join('/aaaa','/bbbb','/ccccc.txt'))\n",
    "print(\"3:\",os.path.join('aaaa','./bbb','ccccc.txt'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "977da630",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NumRooms</th>\n",
       "      <th>Alley</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Pave</td>\n",
       "      <td>127500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>106000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>178100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>140000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NumRooms Alley   Price\n",
       "0       NaN  Pave  127500\n",
       "1       2.0   NaN  106000\n",
       "2       4.0   NaN  178100\n",
       "3       NaN   NaN  140000"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "data = pd.read_csv(data_file)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f3eec3d",
   "metadata": {},
   "source": [
    "#### 2.2.2. 处理缺失值\n",
    "\n",
    "注意，“NaN” 项代表缺失值。为了处理缺失的数据，典型的方法包括 插值 和 删除，其中插值用替代值代替缺失值。而删除则忽略缺失值。在这里，我们将考虑插值。\n",
    "\n",
    "通过位置索引iloc，我们将 data 分成 inputs 和 outputs，其中前者为 data的前两列，而后者为 data的最后一列。对于 inputs 中缺少的的数值，我们用同一列的均值替换 “NaN” 项。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9408683e",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms Alley\n",
      "0       3.0  Pave\n",
      "1       2.0   NaN\n",
      "2       4.0   NaN\n",
      "3       3.0   NaN\n"
     ]
    }
   ],
   "source": [
    "# iloc[:,:]，逗号前是行，后是列，：表示从哪行（列）到哪行（列），如下面的0:2即表示0-2列\n",
    "inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2] # 第二列，即最后一列\n",
    "inputs = inputs.fillna(inputs.mean())\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05410613",
   "metadata": {},
   "source": [
    "### Note：\n",
    "\n",
    "#### iloc函数：通过行号来取行数据（如取第二行的数据）\n",
    "* data.iloc[:, 0:2] 取data的所有行的0-2列\n",
    "* loc函数：通过行索引 \"Index\" 中的具体值来取行数据（如取\"Index\"为\"A\"的行）\n",
    "* 更多：Pandas中loc和iloc函数用法详解（源码+实例）：https://www.jianshu.com/p/dadf2f1b88fc\n",
    "  \n",
    "#### fillna()，mean()函数\n",
    "* fillna函数形式：fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)\n",
    "* 参数：\n",
    " * value：用于填充的空值的值。\n",
    " * 更多参数：pandas 用均值填充缺失值NaN | fillna 方法解析：https://blog.csdn.net/sinat_28442665/article/details/104901143\n",
    "\n",
    "\n",
    "*  mean()函数功能：求取均值：python 的numpy库中的mean()函数用法；https://blog.csdn.net/taotiezhengfeng/article/details/72397282\n",
    "* 经常操作的参数为axis，以m * n矩阵举例：\n",
    " * axis 不设置值，对 mn 个数求均值，返回一个实数\n",
    " * axis = 0：压缩行，对各列求均值，返回 1 n 矩阵\n",
    " * axis = 1 ：压缩列，对各行求均值，返回 m *1 矩阵\n",
    "\n",
    "\n",
    "* mean（A）\n",
    " * 若A为矩阵，则输出每一列的均值（一个向量）\n",
    " * 若A为列向量，则输出均值（一个数）\n",
    " * 若A为行向量，则也是输出均值（一个数），和列向量一样\n",
    " \n",
    " \n",
    " #### 使用jupyter notebook编辑文本和代码：https://www.jianshu.com/p/cef319cb9965\n",
    " * enter两行既是空一行\n",
    " * 按两次dd可以删除单元格"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cf779c6",
   "metadata": {},
   "source": [
    "对于 inputs 中的类别值或离散值，我们将 “NaN” 视为一个类别。由于 “巷子”（“Alley”）列只接受两种类型的类别值 “Alley” 和 “NaN”，pandas 可以自动将此列转换为两列 “Alley_Pave” 和 “Alley_nan”。巷子类型为 “Pave” 的行会将“Alley_Pave”的值设置为1，“Alley_nan”的值设置为0。缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "41424763",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms  Alley_Pave  Alley_nan\n",
      "0       3.0           1          0\n",
      "1       2.0           0          1\n",
      "2       4.0           0          1\n",
      "3       3.0           0          1\n"
     ]
    }
   ],
   "source": [
    "inputs = pd.get_dummies(inputs, dummy_na=True)\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbba4782",
   "metadata": {},
   "source": [
    "## Note\n",
    "###  pd.get_dummies，\n",
    "#### 官方文档：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html\n",
    "\n",
    "* pd.get_dummies(inputs, dummy_na=True)\n",
    " * 默认按值分为几列，同时dummy_na=True表示用bool值表示具体值\n",
    " * 以上只有pave和NAN两种值，所以分为两列，同时pave用1表示，NAN用0表示\n",
    " \n",
    "* 对分类型变量，进行编码处理——pd.get_dummies()、LabelEncoder()、oneHotEncoder()：https://www.cnblogs.com/wyy1480/p/10295084.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc6dc37d",
   "metadata": {},
   "source": [
    "### 2021.05.11 \n",
    "#### 2.2.3. 转换为张量格式\n",
    "现在 inputs 和 outputs 中的所有条目都是数值类型，它们可以转换为张量格式。当数据采用张量格式后，可以通过在 2.1节 中引入的那些张量函数来进一步操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3bf0c52b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([[3., 1., 0.],\n",
       "         [2., 0., 1.],\n",
       "         [4., 0., 1.],\n",
       "         [3., 0., 1.]], dtype=torch.float64),\n",
       " tensor([127500, 106000, 178100, 140000]))"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)\n",
    "x, y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89a057b1",
   "metadata": {},
   "source": [
    "Note：inputs就是前面的房间号（NumRooster）、巷子（Alley）这些，而outputs就是价格price"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "065e07fc",
   "metadata": {},
   "source": [
    "#### 2.2.4. 小结\n",
    "像庞大的 Python 生态系统中的许多其他扩展包一样，pandas 可以与张量兼容。\n",
    "\n",
    "插值和删除可用于处理缺失的数据。\n",
    "\n",
    "#### 2.2.5. 练习\n",
    "创建包含更多行和列的原始数据集。\n",
    "\n",
    "删除缺失值最多的列。\n",
    "\n",
    "将预处理后的数据集转换为张量格式。\n",
    "\n",
    "\n",
    "### 作业："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "965e113c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1、创建原始数据集\n",
    "import os\n",
    "p_datafile = os.path.join('..', 'data', 'house.csv')\n",
    "with open(p_datafile, 'w') as f:\n",
    "    f.write('NumRoos,Alley,Size,Garden,Price\\n')\n",
    "    f.write('NA,Pave,100,Yes,127500\\n')\n",
    "    f.write('2,NA,200,Yes,187500\\n')\n",
    "    f.write('3,NA,150,No,155500\\n')\n",
    "    f.write('NA,NA,90,NA,100500\\n')\n",
    "    f.write('4,Pave,120,Yes,137500\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "515a64db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NumRoos</th>\n",
       "      <th>Alley</th>\n",
       "      <th>Size</th>\n",
       "      <th>Garden</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Pave</td>\n",
       "      <td>100</td>\n",
       "      <td>Yes</td>\n",
       "      <td>127500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200</td>\n",
       "      <td>Yes</td>\n",
       "      <td>187500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>150</td>\n",
       "      <td>No</td>\n",
       "      <td>155500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>90</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "      <td>Pave</td>\n",
       "      <td>120</td>\n",
       "      <td>Yes</td>\n",
       "      <td>137500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NumRoos Alley  Size Garden   Price\n",
       "0      NaN  Pave   100    Yes  127500\n",
       "1      2.0   NaN   200    Yes  187500\n",
       "2      3.0   NaN   150     No  155500\n",
       "3      NaN   NaN    90    NaN  100500\n",
       "4      4.0  Pave   120    Yes  137500"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data1 = pd.read_csv(p_datafile)\n",
    "data1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c972f81",
   "metadata": {},
   "source": [
    "### Note：处理缺失值\n",
    "* df.isnull()#是缺失值返回True，否则范围False\n",
    "* df.isnull().sum()#返回每列包含的缺失值的个数\n",
    "* df.dropna()#直接删除含有缺失值的行\n",
    "* df.dropna(axis = 1)#直接删除含有缺失值的列\n",
    "* df.dropna(how = ‘all’)#只删除全是缺失值的行\n",
    "* df.dropna(thresh = 4)#保留至少有4个缺失值的行\n",
    "* df.dropna(subset = [‘C’])#删除含有缺失值的特定的列\n",
    "* dddf = ddf.dropna(subset=[‘jie_num’],axis=0)#删除含有缺失值的特定的行\n",
    "* datanota = AData[AData[‘marital’].notna()]#删除某列中含有缺失值的行\n",
    "\n",
    "\n",
    "* df.dropna()的Parameters说明，DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)\n",
    " * axis\t0为行 1为列，default 0，数据删除维度\n",
    " * how\t{‘any’, ‘all’}, default ‘any’，any：删除带有nan的行；all：删除全为nan的行\n",
    " * thresh\tint，保留至少 int 个非nan行\n",
    " * subset\tlist，在特定列缺失值处理\n",
    " * inplace\tbool，是否修改源文件\n",
    "\n",
    "### End"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "ba647dea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "NumRoos    2\n",
       "Alley      3\n",
       "Size       0\n",
       "Garden     1\n",
       "Price      0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data1.isna().sum()# 返回每列包含的缺失值的个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "5bff9da8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2、删除缺失值最多的列\n",
    "data1 = data1.dropna(axis=1, thresh=max(data1.isna().sum()))\n",
    "# data1.dropna(axis=1, thresh=3)# 将在列的方向上三个为NaN的项删除"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "4170070b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NumRoos</th>\n",
       "      <th>Size</th>\n",
       "      <th>Garden</th>\n",
       "      <th>Price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.0</td>\n",
       "      <td>100</td>\n",
       "      <td>Yes</td>\n",
       "      <td>127500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>200</td>\n",
       "      <td>Yes</td>\n",
       "      <td>187500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.0</td>\n",
       "      <td>150</td>\n",
       "      <td>No</td>\n",
       "      <td>155500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "      <td>90</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "      <td>120</td>\n",
       "      <td>Yes</td>\n",
       "      <td>137500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NumRoos  Size Garden   Price\n",
       "0      3.0   100    Yes  127500\n",
       "1      2.0   200    Yes  187500\n",
       "2      3.0   150     No  155500\n",
       "3      3.0    90    NaN  100500\n",
       "4      4.0   120    Yes  137500"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data1 = data1.fillna(data1.mean())# 将数值的空值填充为已有数值的平均值\n",
    "data1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "42b42851",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(   NumRoos  Size Garden\n",
       " 0      3.0   100    Yes\n",
       " 1      2.0   200    Yes\n",
       " 2      3.0   150     No\n",
       " 3      3.0    90    NaN\n",
       " 4      4.0   120    Yes,\n",
       " 0    127500\n",
       " 1    187500\n",
       " 2    155500\n",
       " 3    100500\n",
       " 4    137500\n",
       " Name: Price, dtype: int64)"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input1, output1 = data1.iloc[:, 0:3], data1.iloc[:,3]\n",
    "input1, output1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "c1765275",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NumRoos</th>\n",
       "      <th>Size</th>\n",
       "      <th>Garden_No</th>\n",
       "      <th>Garden_Yes</th>\n",
       "      <th>Garden_nan</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.0</td>\n",
       "      <td>100</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>200</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.0</td>\n",
       "      <td>150</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "      <td>90</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "      <td>120</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   NumRoos  Size  Garden_No  Garden_Yes  Garden_nan\n",
       "0      3.0   100          0           1           0\n",
       "1      2.0   200          0           1           0\n",
       "2      3.0   150          1           0           0\n",
       "3      3.0    90          0           0           1\n",
       "4      4.0   120          0           1           0"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input1 = pd.get_dummies(input1, dummy_na=True) # 按值将Garden分为3列\n",
    "input1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "d228a7f3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([[  3., 100.,   0.,   1.,   0.],\n",
       "         [  2., 200.,   0.,   1.,   0.],\n",
       "         [  3., 150.,   1.,   0.,   0.],\n",
       "         [  3.,  90.,   0.,   0.,   1.],\n",
       "         [  4., 120.,   0.,   1.,   0.]], dtype=torch.float64),\n",
       " tensor([127500, 187500, 155500, 100500, 137500]))"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 3、将其转换为张量格式\n",
    "import torch \n",
    "\n",
    "a, b = torch.tensor(input1.values), torch.tensor(output1.values)\n",
    "a, b"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f58fdb83",
   "metadata": {},
   "source": [
    "## torch.Tensor和torch.tensor的区别\n",
    "\n",
    "##### 在Pytorch中，Tensor和tensor都用于生成新的张量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "3696ec70",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([1., 2.])"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = torch.Tensor([1, 2])\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "ca1d5fed",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([1, 2])"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=torch.tensor([1,2])\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61bc8595",
   "metadata": {},
   "source": [
    "首先我们从根源上来看看torch.Tensor()和torch.tensor()区别。\n",
    "\n",
    "torch.Tensor\n",
    "* torch.Tensor()是Python类，更明确的说，是默认张量类型torch.FloatTensor()的别名，torch.Tensor([1,2]) 会调用Tensor类的构造函数__init__，生成单精度浮点类型的张量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "90a30d99",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'torch.FloatTensor'"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=torch.Tensor([1,2])\n",
    "a.type()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e2fb61e",
   "metadata": {},
   "source": [
    "##### torch.tensor()\n",
    "torch.tensor()仅仅是Python的函数，函数原型是：\n",
    "\n",
    "torch.tensor(data, dtype=None, device=None, requires_grad=False)\n",
    "\n",
    "其中data可以是：list, tuple, array, scalar等类型。\n",
    "torch.tensor()可以从data中的数据部分做拷贝（而不是直接引用），根据原始数据类型生成相应的torch.LongTensor，torch.FloatTensor，torch.DoubleTensor。\n",
    "\n",
    "### Tensor的常用基本数据类型主要有以下五种：\n",
    "\n",
    "* 32位浮点型：torch.FloatTensor。Tensor的默认数据类型。\n",
    "* 64位浮点型：torch.DoubleTensor。\n",
    "* 64位整型：torch.LongTensor。\n",
    "* 32位整型：torch.IntTensor。\n",
    "* 16位整：torch.ShortTensor。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "9a9e8cd4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'torch.LongTensor'"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np \n",
    "a = torch.tensor([1, 2])\n",
    "a.type()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "7ae90ecd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'torch.FloatTensor'"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b = torch.tensor([1., 2.])\n",
    "b.type()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "ae99a923",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'torch.DoubleTensor'"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c = np.zeros(2, dtype=np.float64)\n",
    "c = torch.tensor(c)\n",
    "c.type()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "1479e25a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([1.4013e-45]), tensor([1.]))"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a, b=torch.Tensor(1), torch.Tensor([1])\n",
    "a, b"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1636b4b9",
   "metadata": {},
   "source": [
    " 前者的标量1是作为size传入的，后者的向量1是作为value传入的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "f1a60af3",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 2])"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# astype函数用于array中数值类型转换\n",
    "x = np.array([1, 2, 2.5])\n",
    "x.astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8519cf5",
   "metadata": {},
   "source": [
    "更多见：深入浅出之dtype( )和astype( )函数：\n",
    "\n",
    "https://blog.csdn.net/qq_32572085/article/details/85245386"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a22fa680",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
